25 research outputs found
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Transformer architecture has shown impressive performance in multiple
research domains and has become the backbone of many neural network models.
However, there is limited understanding on how it works. In particular, with a
simple predictive loss, how the representation emerges from the gradient
\emph{training dynamics} remains a mystery. In this paper, for 1-layer
transformer with one self-attention layer plus one decoder layer, we analyze
its SGD training dynamics for the task of next token prediction in a
mathematically rigorous manner. We open the black box of the dynamic process of
how the self-attention layer combines input tokens, and reveal the nature of
underlying inductive bias. More specifically, with the assumption (a) no
positional encoding, (b) long input sequence, and (c) the decoder layer learns
faster than the self-attention layer, we prove that self-attention acts as a
\emph{discriminative scanning algorithm}: starting from uniform attention, it
gradually attends more to distinct key tokens for a specific next token to be
predicted, and pays less attention to common key tokens that occur across
different next tokens. Among distinct tokens, it progressively drops attention
weights, following the order of low to high co-occurrence between the key and
the query token in the training set. Interestingly, this procedure does not
lead to winner-takes-all, but decelerates due to a \emph{phase transition} that
is controllable by the learning rates of the two layers, leaving (almost) fixed
token combination. We verify this \textbf{\emph{scan and snap}} dynamics on
synthetic and real-world data (WikiText).Comment: Fix minor issues in the proofs and figures. Update figures to reflect
the main conclusions more accuratel
LESS: Label-efficient Multi-scale Learning for Cytological Whole Slide Image Screening
In computational pathology, multiple instance learning (MIL) is widely used
to circumvent the computational impasse in giga-pixel whole slide image (WSI)
analysis. It usually consists of two stages: patch-level feature extraction and
slide-level aggregation. Recently, pretrained models or self-supervised
learning have been used to extract patch features, but they suffer from low
effectiveness or inefficiency due to overlooking the task-specific supervision
provided by slide labels. Here we propose a weakly-supervised Label-Efficient
WSI Screening method, dubbed LESS, for cytological WSI analysis with only
slide-level labels, which can be effectively applied to small datasets. First,
we suggest using variational positive-unlabeled (VPU) learning to uncover
hidden labels of both benign and malignant patches. We provide appropriate
supervision by using slide-level labels to improve the learning of patch-level
features. Next, we take into account the sparse and random arrangement of cells
in cytological WSIs. To address this, we propose a strategy to crop patches at
multiple scales and utilize a cross-attention vision transformer (CrossViT) to
combine information from different scales for WSI classification. The
combination of our two steps achieves task-alignment, improving effectiveness
and efficiency. We validate the proposed label-efficient method on a urine
cytology WSI dataset encompassing 130 samples (13,000 patches) and FNAC 2019
dataset with 212 samples (21,200 patches). The experiment shows that the
proposed LESS reaches 84.79%, 85.43%, 91.79% and 78.30% on a urine cytology WSI
dataset, and 96.88%, 96.86%, 98.95%, 97.06% on FNAC 2019 dataset in terms of
accuracy, AUC, sensitivity and specificity. It outperforms state-of-the-art MIL
methods on pathology WSIs and realizes automatic cytological WSI cancer
screening.Comment: This paper was submitted to Medical Image Analysis. It is under
revie
Angular Visual Hardness
Recent convolutional neural networks (CNNs) have led to impressive performance but often suffer from poor calibration. They tend to be overconfident, with the model confidence not always reflecting the underlying true ambiguity and hardness. In this paper, we propose angular visual hardness (AVH), a score given by the normalized angular distance between the sample feature embedding and the target classifier to measure sample hardness. We validate this score with an in-depth and extensive scientific study, and observe that CNN models with the highest accuracy also have the best AVH scores. This agrees with an earlier finding that state-of-art models improve on the classification of harder examples. We observe that the training dynamics of AVH is vastly different compared to the training loss. Specifically, AVH quickly reaches a plateau for all samples even though the training loss keeps improving. This suggests the need for designing better loss functions that can target harder examples more effectively. We also find that AVH has a statistically significant correlation with human visual hardness. Finally, we demonstrate the benefit of AVH to a variety of applications such as self-training for domain adaptation and domain generalization
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
Large Language Models (LLMs), armed with billions of parameters, exhibit
exceptional performance across a wide range of Natural Language Processing
(NLP) tasks. However, they present a significant computational challenge during
inference, especially when deploying on common hardware such as single GPUs. As
such, minimizing the latency of LLM inference by curtailing computational and
memory requirements, though achieved through compression, becomes critically
important. However, this process inevitably instigates a trade-off between
efficiency and accuracy, as compressed LLMs typically experience a reduction in
predictive precision. In this research, we introduce an innovative perspective:
to optimize this trade-off, compressed LLMs require a unique input format that
varies from that of the original models. Our findings indicate that the
generation quality in a compressed LLM can be markedly improved for specific
queries by selecting prompts with precision. Capitalizing on this insight, we
introduce a prompt learning paradigm that cultivates an additive prompt over a
compressed LLM to bolster their accuracy. Our empirical results imply that
through our strategic prompt utilization, compressed LLMs can match, and
occasionally even exceed, the accuracy of the original models. Moreover, we
demonstrated that these learned prompts have a certain degree of
transferability across various datasets, tasks, and compression levels. These
insights shine a light on new possibilities for enhancing the balance between
accuracy and efficiency in LLM inference. Specifically, they underscore the
importance of judicious input editing to a compressed large model, hinting at
potential advancements in scaling LLMs on common hardware